Smoothing Categorical Data
نویسندگان
چکیده
Global models of a dataset reflect not only the large scale structure of the data distribution, they also reflect small(er) scale structure. Hence, if one wants to see the large scale structure, one should somehow subtract this smaller scale structure from the model. While for some kinds of model – such as boosted classifiers – it is easy to see the “important” components, for many kind of models this is far harder, if at all possible. In such cases one might try an implicit approach: simplify the data distribution without changing the large scale structure. That is, one might first smooth the local structure out of the dataset. Then induce a new model from this smoothed dataset. This new model should now reflect the large scale structure of the original dataset. In this paper we propose such a smoothing for categorical data and for one particular type of models, viz., code tables. By experiments we show that our approach preserves the large scale structure of a dataset well. That is, the smoothed dataset is simpler while the original and smoothed datasets share the same large scale structure.
منابع مشابه
Local Smoothing with given Marginals
In models using categorical data one may use adjacency relations to justify smoothing to improve upon simple histogram approximations of the probabilities. This is particularly convenient for sparsely observed or rather peaked distributions. Moreover, in a few models, prior knowledge of a marginal distribution is available. We adapt local polynomial estimators to include this partial informatio...
متن کاملAnalyzing Longitudinal Data Using Gee-Smoothing Spline
This paper considers nonparametric regression to analyze longitudinal data. Some developments of nonparametric regression have been achieved for longitudinal or clustered categorical data. For exponential family distribution, Lin & Carroll [6] considered nonparametric regression for longitudinal data using GEE-Local Polynomial Kernel (LPK). They showed that in order to obtain an efficient estim...
متن کاملStatistical Notions of Data Disclosure Avoidance and Their Relationship to Traditional Statistical Methodology: Data Swapping and Loglinear Models
For most data releases especially those from censuses, the U. S. Bureau of the Census has either released data at high levels of aggregation or applied a data disclosure avoidance procedure such as data swapping or cell suppression before preparing micro-data or tables for release. In this paper, we present a general statistical characterization of the goal of a statistical agency in releasing ...
متن کاملBayesian inference for categorical data analysis
This article surveysBayesianmethods for categorical data analysis, with primary emphasis on contingency table analysis. Early innovations were proposed by Good (1953, 1956, 1965) for smoothing proportions in contingency tables and by Lindley (1964) for inference about odds ratios. These approaches primarily used conjugate beta and Dirichlet priors. Altham (1969, 1971) presented Bayesian analogs...
متن کاملPrediction of Notes from Vocal Time Series Produced by Singing Voice
Aiming at optimal prediction of the correct note corresponding to a vocal time series we trained a classification algorithm on the basis of parts of interpretations of Tochter Zion (Händel) and tested the algorithm on the remaining parts. As classification algorithm we use a radial basis function support vector machine together with a “Hidden Markov” method as a dynamisation mechanism and some ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012